Draft of article based on discussions about TCP Info data and caveats analyzing it#9
Draft of article based on discussions about TCP Info data and caveats analyzing it#9jduckles wants to merge 4 commits into
Conversation
… about analyzing it
robertodauria
left a comment
There was a problem hiding this comment.
Thanks! I've added some comments — see below.
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> | ||
| <!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. --> | ||
| <!-- FIXME: Verify that the RTT/RTTVar fields cited above match the current ndt.tcpinfo schema exactly — column paths may differ between the ndt.tcpinfo view and raw tables. --> |
There was a problem hiding this comment.
I would expect the verification to happen before the KB article is posted. Could you please confirm that the TCPInfo schema matches?
|
|
||
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. | ||
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> |
There was a problem hiding this comment.
TODOs in code comments aren't very visible — I'd rather wait until we have a public link to add here (if posting this isn't urgent), or create an issue/a CU task to document what is missing before merging this PR, perhaps assigning the person this is blocked on.
Also, AFAIK M-Lab's Slack isn't exactly "public" the same way the Discuss list is, it's on invitation.
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. | ||
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> | ||
| <!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. --> |
There was a problem hiding this comment.
Same: either add the section as part of this PR, or create an issue instead of a TODO in a comment.
(this applies to every other TODO in this file)
| ORDER BY num_snapshots | ||
| ``` | ||
|
|
||
| Comparing the two outputs makes the noise problem concrete: the first query will show a large fraction of 1–2 snapshot rows; the second (UUID-joined) query will show a clean distribution concentrated at 40–100 snapshots. |
There was a problem hiding this comment.
Since we're inviting a comparison here, I think it would be helpful if the two queries used the same date.
They also LIMIT 10000 in the inner query with no ORDER BY, which I believe makes the output non-deterministic. They then use this sample to compute a percentage, which would be non-deterministic as well.
| gs://archive-measurement-lab/ndt/tcpinfo/YYYY/MM/DD/ | ||
| ``` | ||
|
|
||
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. |
There was a problem hiding this comment.
Files are stored in
.zst-compressed JSONL format
This is correct but omits the tarball layer: users will find .tgz archives containing per-connection .jsonl.zst files.
Co-authored-by: Roberto D'Auria <roberto@measurementlab.net>
Co-authored-by: Roberto D'Auria <roberto@measurementlab.net>
Co-authored-by: Roberto D'Auria <roberto@measurementlab.net>
Hey @sermpezis and @robertodauria could you please review and edit this as you see fit. I pulled it together from all the discussion, document, slack context using the new kb article Claude skill in this repo inside of
.claude/skills/mlab-kb-article.